Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability. (arXiv:2104.01408v2 [cs.CL] UPDATED)
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent
years. However, the generated voice is often not perceptually identifiable by
its intended emotion category. To address this problem, we propose a new
interactive training paradigm for ETTS, denoted as i-ETTS, which seeks to
directly improve the emotion discriminability by interacting with a speech
emotion recognition (SER) model. Moreover, we formulate an iterative training
strategy with reinforcement learning to ensure the quality of i-ETTS
optimization. Experimental results demonstrate that the proposed i-ETTS
outperforms the state-of-the-art baselines by rendering speech with more
accurate emotion style. To our best knowledge, this is the first study of
reinforcement learning in emotional text-to-speech synthesis.